Automatic Extraction of New Words from Japanese Texts using Generalized Forward-Backward Search

نویسنده

  • Masaaki Nagata
چکیده

We present a novel new word extraction method from Japanese texts based on expected word frequencies. First, we compute expected word frequencies from Japanese texts using a robust stochastic N-best word segmenter. We then extract new words by filtering out erroneous word hypotheses whose expected word frequencies are lower than the predefined threshold. The method is derived from an approximation of the generalized version of the Forward-Backward algorithm. When the Japanese word segmenter is trained on a 4.7 million word segmented corpus and tested on 1000 sentences whose out-of-vocabulary rate is 2.1%, the accuracy of the new word extraction method is 43.7% recall and 52.3% precision.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

A Stochastic Japanese Morphological Analyzer Using a Forward-DP Backward-A* N-Best Search Algorithm

We present a novel method for segmenting the input sentence into words and assigning parts of speech to the words. It consists of a statistical language model and an efficient two-pa~qs N-best search algorithm. The algorithm does not require delimiters between words. Thus it is suitable for written Japanese. q'he proposed Japanese morphological analyzer achieved 95. l% recall and 94.6% precisio...

متن کامل

Symmetric Statistical Translation Models for Automatic Image Annotation

Automatic image annotation provides means for users to search image collections on the semantic level using natural language queries. In the past, statistical machine translation models have been successfully applied to automatic image annotation. A problem with this approach is that, due to the skewed distribution of term frequency for annotation words, common words have been overly favored, w...

متن کامل

Extracting Concepts from Dynamic Legislative Text Collections

Selecting discriminating terms in order to represent the contents of texts is a critical problem for many applications in Information Retrieval. Most of the Information Retrieval systems index documents based on individual words that are not specific enough to evidence the contents of texts. As a consequence, there has been a growing interest in developing techniques for automatic term extracti...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996